This course Open Data Science requires several skills from statisticians and other data scientists. In this course we will learn to code, recode, program, visualize, analyze, and interpret.


The Analysis, Marja Holm

1.Read the students2014 data into R either from your local folder or from this url. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

#Read thedata into R from the url
lrn14 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt")
#the structure of the data
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
#the dimensions of the data
dim(lrn14)
## [1] 166   7

Description of the data

Note also that

2.Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

2.1. A scatter plot matrix of the variables

# draw a scatter plot matrix of the variables.
pairs(lrn14)

2.2. More advanced plot

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# a more advanced plot matrix with ggpairs()
p <- ggpairs(lrn14, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

2.3. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them:

Distributions

Relationships:

3. Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = lrn14)
# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Interpret the results. Explain and interpret the statistical test related to the model parameters.

If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

Now, we remove variables stra and surf:

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude, data = lrn14)
# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4. Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R squared of the model.

As we mentioned earlier, the relationship between the exam points and attitude was significant (beeta=3.53, p<.001); indicating that when attitude change by one unit, the average change in exam points is 3.53 points. Note that we look now the model where the nonsignificant variables were removed. The multiple R-squared was R2=.19; indicating that attitude explained 19% of the variance in exam points.

5. Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

Noted that only significant explanatory variable attitude was included in the model.

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which = c(1:2, 5))

Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.


Exercise 3, Marja Holm, 10.2.2017

2. Read the joined student alcohol consumption data into R either from your local folder or from the url. Print out the names of the variables in the data and describe the data set briefly, assuming the reader has no previous knowledge of it.

alc <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = "," , header=TRUE)

#the structure of the data
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
#the dimensions of the data
dim(alc)
## [1] 382  35

Describe the data set briefly, assuming the reader has no previous knowledge of it.

This data consists of 382 observations and 35 variables.

Factor variables:

  1. school: student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  2. sex: student’s sex (binary: ‘F’ - female or ‘M’ - male)
  3. adress: student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  4. famsize: family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  5. Pstatus: parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  6. Mjob: mother’s job (nominal, ‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’ or ‘other’)
  7. Fjob: father’s job (nominal, ‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’ or ‘other’)
  8. reason: reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  9. nursery:attended nursery school (binary: yes or no)
  10. internet:Internet access at home (binary: yes or no)
  11. guardian: student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  12. schoolsup:extra educational support (binary: yes or no)
  13. famsup: family educational support (binary: yes or no)
  14. paid:extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  15. activities:extra-curricular activities (binary: yes or no)
  16. higher:wants to take higher education (binary: yes or no)
  17. romantic:with a romantic relationship (binary: yes or no)

Integer variables:

  1. age: student’s age (numeric: from 15 to 22)
  2. Medu: mother’s education (numeric: 0-none, 1-primary, 3-secondary or 4-higher)
  3. Fedu: father’s education (numeric: 0-none, 1-primary, 3-secondary or 4-higher)
  4. traveltime: home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  5. studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  6. failures: number of past class failures (numeric: n if 1<=n<3, else 4)
  7. famrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  8. freetime: free time after school (numeric: from 1 - very low to 5 - very high)
  9. goout: going out with friends (numeric: from 1 - very low to 5 - very high)
  10. Dalc: workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  11. Walc: weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  12. health: current health status (numeric: from 1 - very bad to 5 - very good)
  13. absences: number of school absences (numeric: from 0 to 93)
  14. G1: first period grade (numeric: from 0 to 20)
  15. G2: second period grade (numeric: from 0 to 20)
  16. G3: final grade (numeric: from 0 to 20, output target)

Numeric:

  1. alc_use: average of weekday and weekend alcohol use

Logical:

  1. high_use: TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

3. The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.

We selected variables that might have relationships with alcohol consumption.

H1. Going out with friends (goout) might have a positive association with alcohol consumption (alc_use).

H2. Weekly study time (studytime) might have a negative association with alcohol consumption (alc_use).

H3. Students’ age (age: 15 to 20) might have a positive association with alcohol consumption (alc_use); indicating that older students might have more alcohol consumption than younger ones .

H4. Number of past class failures (failures) might have a positive association with alcohol consumption (alc_use).

4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.

First, the data (named: dta) including only study variables (goout, studytime, age, failures, alc_use, high_use) were created.

#Access the dplyr library
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Choosing what columns are kept for analysis dataset
keep_columns<-c("goout","studytime","age", "failures", "alc_use", "high_use")
#Access the dplyr library
library(dplyr)
#Creating analysis dataset that includes these variables
dta <- select(alc, one_of(keep_columns))
str(dta)
## 'data.frame':    382 obs. of  6 variables:
##  $ goout    : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ studytime: int  2 2 2 3 2 2 2 2 2 2 ...
##  $ age      : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ failures : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ alc_use  : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
#Saving the R.file into scv format. Before this the working directory has been changed to IODS-project and under Data folder.
write.table(dta, file="dta.csv")
 dta<-read.table("dta.csv")        

4.1. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# a more advanced plot matrix with ggpairs()
p <- ggpairs(dta,  mapping = aes(),lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Figure 1. Correlations and distributions

ta0<-xtabs(~alc_use+goout, data=dta)
ta0
##        goout
## alc_use  1  2  3  4  5
##     1   16 49 45 25  9
##     1.5  3 22 29  8  6
##     2    2 13 26 10  7
##     2.5  2  7 13 16  4
##     3    0  4  3 14 11
##     3.5  1  2  4  3  7
##     4    0  0  1  5  3
##     4.5  0  1  0  1  1
##     5    0  1  2  0  6

Table1. Cross-tabulations: Relationships between alcohol use (alc_use) and going out with friends (goout)

ta1<-xtabs(~alc_use+studytime, data=dta)
ta1
##        studytime
## alc_use  1  2  3  4
##     1   22 74 31 17
##     1.5 21 33 10  4
##     2   17 26 13  2
##     2.5 11 25  5  1
##     3   12 18  1  1
##     3.5 11  4  2  0
##     4    4  4  0  1
##     4.5  0  3  0  0
##     5    5  3  0  1

Table 2. Cross-tabulations: relationships between alcohol use (alc_use) and weekly study time (studytime)

ta2<-xtabs(~alc_use+age, data=dta)
ta2
##        age
## alc_use 15 16 17 18 19 20 22
##     1   46 41 26 27  3  1  0
##     1.5  9 24 20 12  3  0  0
##     2    8 14 19 16  1  0  0
##     2.5  8  9 16  8  1  0  0
##     3    4  9  8  8  3  0  0
##     3.5  3  3  5  6  0  0  0
##     4    2  4  2  1  0  0  0
##     4.5  0  0  2  1  0  0  0
##     5    1  3  2  2  0  0  1

Table 3. Cross-tabulations: relationships between alcohol use (alc_use) and age

ta3<-xtabs(~alc_use+failures, data=dta)
ta3
##        failures
## alc_use   0   1   2   3
##     1   127  10   1   6
##     1.5  60   6   0   2
##     2    47   6   4   1
##     2.5  36   3   2   1
##     3    20   4   4   4
##     3.5  10   5   0   2
##     4     6   3   0   0
##     4.5   3   0   0   0
##     5     7   1   0   1

Table 4. Cross-tabulations: relationships between alcohol use and number of past class failures (failures)

library(ggplot2)

alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = goout))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("goout")+aes(col=alc_use)+ggtitle("Going out with friends by alcohol use")

Figure 2. Box plots I

library(ggplot2)

alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = studytime))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("studytime")+aes(col=alc_use)+ggtitle("Students' weekly study time by alcohol use")

Figure 3. Box plots II

library(ggplot2)

alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = age))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("age")+aes(col=alc_use)+ggtitle("Student age by alcohol use")

Figure 4. Box plots III

library(ggplot2)

alc$alc_use<-as.factor(alc$alc_use)
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = alc_use, y = failures))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("failures")+aes(col=alc_use)+ggtitle("Number of past class failures by alcohol use")

Figure 5. Box plots IV

4.2. Comment on your findings and compare the results of your exploration to your previously stated hypotheses.

Distributions

Figure 1 showed distributions of the variables:

Dependent variables

Relationship

The figure 1 also showed that our hypotheses were quite true. There were positive correlation between goout and alc_use (r=.388) (H1), negative correlation between studytime and alc_use (r=-.24) (H2), positive correlation between age and alc_use (r=.156) (H3), and positive correlation between failures and alc_use (r=.15) (H4).

However, cross-tabulations and box plots describe the relationship more deeply; showing that there are also a lot of variations.

Table 1 and also figure 1 showed that most students going out about average (2 or 3) have quite low alcohol use, while most of those students going out very high level have also high alcohol use.

Table 2 and also figure 2 showed that most of those students studying 2 hours have quite low alcohol use. Noted also that most of those students using alcohol in high level often studied 1 or 2 hours, while often students studying over 2 hours have low alcohol use with a few exceptions.

Table 3 and also figure 3 showed that students aged around 16 often have low alcohol use, while age level are little bit higher (in average) if students used alcohol in higher level (see figure 4) with a few exceptions.

Table 4 and also figure 4 showed that students without failures often showed low alcohol use, but also high alcohol use with a few exceptions indicating this quite low correlation (r=.15).

5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.

# find the model with glm()
m <- glm(high_use ~ goout+ studytime+age+failures, data = dta, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + studytime + age + failures, 
##     family = "binomial", data = dta)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8036  -0.7813  -0.5332   0.8684   2.5526  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.76790    1.84119  -2.046  0.04071 *  
## goout        0.70987    0.11705   6.064 1.32e-09 ***
## studytime   -0.59020    0.16849  -3.503  0.00046 ***
## age          0.09935    0.11028   0.901  0.36764    
## failures     0.17576    0.17195   1.022  0.30672    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 394.64  on 377  degrees of freedom
## AIC: 404.64
## 
## Number of Fisher Scoring iterations: 4

Interpret a summary of the fitted model.

Below, you found summary of the fitted model. As the z-value are only significant (enough big) for goout and studytime, the corresponding variables only matters (the corresponding true regression coefficient is not 0). In the other words, only relationships between goout and high_use, and studytime and high_use are significant.

Noted also that that the standard errors of these four study variables were quite low (below 1); indicating that they are not too spread out.

# print out the coefficients of the model
coef(m)
## (Intercept)       goout   studytime         age    failures 
## -3.76789872  0.70986928 -0.59020183  0.09935216  0.17575616
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI<-exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02310055 0.0005942399 0.8239749
## goout       2.03372540 1.6262645163 2.5759180
## studytime   0.55421541 0.3939867475 0.7640697
## age         1.10445518 0.8902994937 1.3733059
## failures    1.19214732 0.8494144571 1.6720407

Interpret the coefficients of the model as odds ratios

We only interpret the significant variables:

The odds ratios OR=2.03 for going out with friends was higher than one; indicating positive relationship. More student go out with friends, the more his/her alcohol use is at a high level (TRUE). In other words, if students go out with friends often, he/she have about 2 times likely to have high alcohol use.

The odds ratios OR=0.55 for studytime was lower than one; indicating negative relationship. Less student study, the more his/her alcohol use is at a high level (TRUE). In other words, if students studytime is low level, he/she have about 0.55 times likely to have high alcohol use.

compare them to your previously stated hypothesis.

Only hypothesis H1 (Going out with friends might have a positive association with alcohol consumption) and H2 (Weekly study time might have a negative association with alcohol consumption) were evident when these four dependent variables were at the same logistic regression model (controlling the each other’s effects).

6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

Noted that only variables that have significant relationship with high/low alcohol consumptions were left in the next model.

# fit the model
m <- glm(high_use ~ goout + studytime, data = dta, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
dta <- mutate(dta, probability = probabilities)

# use the probabilities to make a prediction of high_use
dta <- mutate(dta, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = dta$high_use, prediction = dta$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   22
##    TRUE     70   42
table(high_use = dta$high_use, prediction = dta$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64921466 0.05759162 0.70680628
##    TRUE  0.18324607 0.10994764 0.29319372
##    Sum   0.83246073 0.16753927 1.00000000

Above you found probabilities of the target variables versus prediction. Indicating that our model with the two variables (goout and studytime) predicted correctly 65% of the FALSE values (low alcohol use) and 11% of the TRUE values (high-alcohol use).

7. Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error). Could you find such a model?

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(dta$high_use, dta$probability)
## [1] 0.2408377
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = dta, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2408377

Above you found the 10-fold cross-validation syntax and error as output. The data is divided into subsets 10 times and eventually all the data is used for both training and testing.

My model has about 0.24 training and testing errors, and thus my model is better than the model introduced in DataCamp (which had about 0.26 error)

8. Super-Bonus: Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.

MODEL1

First model with very high number of predictors

# fit the model
m <- glm(high_use ~ school+ age+ failures + absences + sex+internet+schoolsup+famsup+paid+activities+higher+romantic+Medu+Fedu+traveltime+studytime+famrel+freetime+goout+health+G1+G2+G3, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ school + age + failures + absences + 
##     sex + internet + schoolsup + famsup + paid + activities + 
##     higher + romantic + Medu + Fedu + traveltime + studytime + 
##     famrel + freetime + goout + health + G1 + G2 + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7676  -0.6984  -0.4146   0.5511   2.6452  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.47222    2.71628  -1.646 0.099671 .  
## schoolMS       0.07936    0.49215   0.161 0.871888    
## age            0.07185    0.13871   0.518 0.604479    
## failures       0.16623    0.21566   0.771 0.440820    
## absences       0.06836    0.01808   3.781 0.000156 ***
## sexM           0.94855    0.31300   3.031 0.002441 ** 
## internetyes   -0.13830    0.40076  -0.345 0.730020    
## schoolsupyes  -0.18560    0.44390  -0.418 0.675869    
## famsupyes     -0.20794    0.31073  -0.669 0.503368    
## paidyes        0.62269    0.29995   2.076 0.037897 *  
## activitiesyes -0.50000    0.28663  -1.744 0.081090 .  
## higheryes     -0.22285    0.69263  -0.322 0.747653    
## romanticyes   -0.32750    0.31513  -1.039 0.298687    
## Medu          -0.16788    0.17488  -0.960 0.337051    
## Fedu           0.21586    0.17454   1.237 0.216186    
## traveltime     0.47690    0.21231   2.246 0.024689 *  
## studytime     -0.27991    0.19194  -1.458 0.144757    
## famrel        -0.48392    0.15708  -3.081 0.002065 ** 
## freetime       0.20776    0.15361   1.352 0.176228    
## goout          0.78514    0.14134   5.555 2.78e-08 ***
## health         0.15662    0.10301   1.520 0.128420    
## G1            -0.17182    0.09110  -1.886 0.059289 .  
## G2             0.11897    0.11396   1.044 0.296508    
## G3             0.03662    0.08252   0.444 0.657196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 340.34  on 358  degrees of freedom
## AIC: 388.34
## 
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
##   (Intercept)      schoolMS           age      failures      absences 
##   -4.47222082    0.07936423    0.07184607    0.16623121    0.06836059 
##          sexM   internetyes  schoolsupyes     famsupyes       paidyes 
##    0.94854610   -0.13830143   -0.18559565   -0.20793732    0.62268537 
## activitiesyes     higheryes   romanticyes          Medu          Fedu 
##   -0.50000158   -0.22284511   -0.32749905   -0.16788349    0.21586182 
##    traveltime     studytime        famrel      freetime         goout 
##    0.47690408   -0.27990842   -0.48391665    0.20775575    0.78514099 
##        health            G1            G2            G3 
##    0.15661541   -0.17181500    0.11897195    0.03662077
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   17
##    TRUE     54   58
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1858639
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2329843

As the training and testing error in MODEL1 are lower than our first study model with the four independent variables (training=testing error=0.241), we are moving forward.

MODEL2

In this second model, only significant variables will be let in the model2: absences, sex, paid, activities, traveltime, famrel, goout, and G1.

# fit the model
m <- glm(high_use ~ absences + sex+paid+activities+traveltime+famrel+goout+G1, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + paid + activities + 
##     traveltime + famrel + goout + G1, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0999  -0.7200  -0.4520   0.6456   2.9602  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.96357    0.89179  -3.323  0.00089 ***
## absences       0.06639    0.01678   3.956 7.63e-05 ***
## sexM           1.20658    0.27761   4.346 1.38e-05 ***
## paidyes        0.46782    0.26894   1.739  0.08195 .  
## activitiesyes -0.57222    0.27011  -2.118  0.03414 *  
## traveltime     0.45079    0.18804   2.397  0.01651 *  
## famrel        -0.45235    0.14481  -3.124  0.00179 ** 
## goout          0.80909    0.13091   6.180 6.39e-10 ***
## G1            -0.04001    0.04087  -0.979  0.32757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 356.39  on 373  degrees of freedom
## AIC: 374.39
## 
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
##   (Intercept)      absences          sexM       paidyes activitiesyes 
##   -2.96356793    0.06638932    1.20658118    0.46781964   -0.57222077 
##    traveltime        famrel         goout            G1 
##    0.45079294   -0.45234746    0.80908756   -0.04001193
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   256   14
##    TRUE     58   54
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1884817
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.1989529

As the testing error in the MODEL2 is lower than in the MODEL1, we are moving forward.

MODEL3

Only significant will be let in the third model: absences, sex, paid, activities, traveltime, famrel, and goout.

# fit the model
m <- glm(high_use ~ absences + sex+paid+activities+traveltime+famrel+goout, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + paid + activities + 
##     traveltime + famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0675  -0.7223  -0.4531   0.6708   2.8999  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.45743    0.74523  -4.639 3.49e-06 ***
## absences       0.06708    0.01674   4.007 6.16e-05 ***
## sexM           1.18030    0.27583   4.279 1.88e-05 ***
## paidyes        0.44611    0.26714   1.670  0.09494 .  
## activitiesyes -0.59722    0.26886  -2.221  0.02633 *  
## traveltime     0.47011    0.18649   2.521  0.01171 *  
## famrel        -0.45315    0.14489  -3.128  0.00176 ** 
## goout          0.83028    0.12985   6.394 1.61e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 357.35  on 374  degrees of freedom
## AIC: 373.35
## 
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
##   (Intercept)      absences          sexM       paidyes activitiesyes 
##   -3.45743452    0.06707767    1.18030243    0.44610789   -0.59722408 
##    traveltime        famrel         goout 
##    0.47011168   -0.45314583    0.83027754
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   255   15
##    TRUE     56   56
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1858639
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.1963351

As the training and testing error in the MODEL3 are slightly lower than in the MODEL2, this MODEL3 might be better.

MODEL4

As the all predictors were significant with the exeption of paid (p=0.10), we also tested the MODEL4 in which this paid variable was eliminated.

# fit the model
m <- glm(high_use ~ absences + sex+activities+traveltime+famrel+goout, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + sex + activities + traveltime + 
##     famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9612  -0.7083  -0.4746   0.7186   2.8026  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.16085    0.71564  -4.417 1.00e-05 ***
## absences       0.06634    0.01678   3.954 7.69e-05 ***
## sexM           1.11730    0.27107   4.122 3.76e-05 ***
## activitiesyes -0.58263    0.26744  -2.179  0.02937 *  
## traveltime     0.45011    0.18555   2.426  0.01528 *  
## famrel        -0.45128    0.14424  -3.129  0.00176 ** 
## goout          0.82124    0.12879   6.377 1.81e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 360.17  on 375  degrees of freedom
## AIC: 374.17
## 
## Number of Fisher Scoring iterations: 5
# print out the coefficients of the model
coef(m)
##   (Intercept)      absences          sexM activitiesyes    traveltime 
##   -3.16085491    0.06633946    1.11730440   -0.58262984    0.45011152 
##        famrel         goout 
##   -0.45128170    0.82124209
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   17
##    TRUE     58   54
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.1963351
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2041885

As the training and testing error in the MODEL4 are higher than in the MODEL3, this MODEL3 might be one of the best model to describe the high alcohol use. In this model, absences (number of school absences), sex, paid (extra paid classes), activities (extra-curriculum activities), traveltime (home to school traveltime), famrel (quality of family relationships) and goout (going out of friends) describe high alcohol use.

Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.

predictors<-c(23, 8, 7, 6)
training_error<-c(0.186, 0.188, 0.186, 0.196)
qplot(x=predictors, y=training_error)

predictors<-c(23, 8, 7, 6)
testin_error<-c(0.230, 0.204, 0.199, 0.217)
qplot(x=predictors, y=testin_error)


RStudio Exercise 4, Marja Holm, Analysis exercises, 17.2.2017

2. Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# structure and the dimensions of the data 
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim (Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Describe the dataset briefly

crim=per capita crime rate by town.

zn=proportion of residential land zoned for lots over 25,000 sq.ft.

indus=proportion of non-retail business acres per town.

chas=Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox=nitrogen oxides concentration (parts per 10 million).

rm=average number of rooms per dwelling.

age=proportion of owner-occupied units built prior to 1940.

dis= weighted mean of distances to five Boston employment centres.

rad=index of accessibility to radial highways.

tax=full-value property-tax rate per $10,000.

ptratio=pupil-teacher ratio by town.

black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat=lower status of the population (percent).

medv=median value of owner-occupied homes in $1000s.

3. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# a more advanced plot matrix with ggpairs()
p <- ggpairs(Boston,  mapping = aes(),lower = list(combo = wrap("facethist", bins = 30)))

# draw the plot
p

Figure 1. Distributions.

# MASS, corrplot, tidyverse and Boston dataset are available
library(corrplot)
library(dplyr)
library(MASS)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix%>%round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) 

Distributions

Figure 1 showed distributions of the variables.

  1. Only RM (average number of rooms per dwelling) and medv (median value of owner-occupied homes in $1000s) were close to the normal distribution.

  2. Variables: crim (per capita crime rate by town), zn (proportion of residential land zoned), chas (Charles River dummy variable), nox (nitrogen oxides concentration), dis (weighted mean of distances to five Boston employment centres), rad (index of accessibility to radial highways) and Istat (lower status of the population) were mostly skewed to the left, especially crim, zn and chas got real small values.

  3. Variables: age (proportion of owner-occupied units built prior to 1940), ptratio (pupil-teacher ratio by town), and black (proportion of blacks by town) were skewed to the right.

  4. Variable: indus (proportion of non-retail business acres per town) has two peak values.

Relationship

Next, we describe relationships between variables by using correlation table and corrplot (above). In general, there were quite high correlations between variables. Only chas variable showed low correlation with other variables. Below, we describe only some of the high correlation between variables. From the top you will find detailed descriptions of the variables and below are shown only abbreviations.

4. Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset. Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

Standardize the dataset and print out summaries of the scaled data.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

How did the variables change?

In the scaling, we subtract the column means from the corresponding columns and divide the difference with standard deviation. Note that the means of the variables are always zero. This scaling will help to reach the normality distributions (1. assumptions for linear discriminant analysis (LDA)) and that the variables have the same variance (2. assumption for LDA).

Create a categorical variable of the crime rate in the Boston dataset. Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim

# summary of the scaled_crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime
crime
##   [1] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##   [5] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
##   [9] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [13] (-0.411,-0.39]  (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [17] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [21] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [25] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [29] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
##  [33] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] [-0.419,-0.411]
##  [37] (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39]  [-0.419,-0.411]
##  [41] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [45] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [49] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411]
##  [53] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [57] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
##  [61] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [65] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [69] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
##  [73] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39] 
##  [77] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39] 
##  [81] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [85] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [89] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
##  [93] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] 
##  [97] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411]
## [101] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [105] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [109] (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39]  (-0.411,-0.39] 
## [113] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [117] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [121] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
## [125] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.39,0.00739]
## [129] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [133] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [137] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739]
## [141] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (0.00739,9.92] 
## [145] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [149] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [153] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [157] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [161] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [165] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [169] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [173] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411]
## [177] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [181] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
## [185] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [189] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411]
## [193] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [197] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [201] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [205] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [209] (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739]
## [213] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39] 
## [217] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
## [221] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [225] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [229] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [233] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [237] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.411,-0.39] 
## [241] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [245] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39] 
## [249] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [253] (-0.411,-0.39]  (-0.39,0.00739] [-0.419,-0.411] [-0.419,-0.411]
## [257] [-0.419,-0.411] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [261] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [265] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [269] (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39] 
## [273] (-0.411,-0.39]  (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39] 
## [277] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] 
## [281] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [285] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [289] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [293] [-0.419,-0.411] (-0.411,-0.39]  [-0.419,-0.411] (-0.411,-0.39] 
## [297] [-0.419,-0.411] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411]
## [301] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39]  (-0.411,-0.39] 
## [305] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [309] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739]
## [313] (-0.39,0.00739] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39] 
## [317] (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739] (-0.39,0.00739]
## [321] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.39,0.00739]
## [325] (-0.39,0.00739] (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39] 
## [329] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [333] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [337] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [341] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [345] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [349] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [353] [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411] (-0.411,-0.39] 
## [357] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [361] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [365] (-0.39,0.00739] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [369] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [373] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [377] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [381] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [385] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [389] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [393] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [397] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [401] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [405] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [409] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [413] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [417] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [421] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [425] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [429] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [433] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [437] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [441] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [445] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [449] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [453] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [457] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [461] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [465] (0.00739,9.92]  (-0.39,0.00739] (0.00739,9.92]  (0.00739,9.92] 
## [469] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [473] (-0.39,0.00739] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [477] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92] 
## [481] (0.00739,9.92]  (0.00739,9.92]  (0.00739,9.92]  (-0.39,0.00739]
## [485] (-0.39,0.00739] (-0.39,0.00739] (0.00739,9.92]  (0.00739,9.92] 
## [489] (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39]  (-0.411,-0.39] 
## [493] (-0.411,-0.39]  (-0.411,-0.39]  (-0.39,0.00739] (-0.411,-0.39] 
## [497] (-0.39,0.00739] (-0.39,0.00739] (-0.411,-0.39]  (-0.411,-0.39] 
## [501] (-0.411,-0.39]  [-0.419,-0.411] [-0.419,-0.411] [-0.419,-0.411]
## [505] (-0.411,-0.39]  [-0.419,-0.411]
## 4 Levels: [-0.419,-0.411] (-0.411,-0.39] ... (0.00739,9.92]
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5. Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

# linear discriminant analysis, crime rate as the target variable and all other as predictors
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2524752       0.2425743       0.2524752       0.2524752 
## 
## Group means:
##                          zn      indus        chas        nox         rm
## [-0.419,-0.411]  1.01493086 -0.8879973 -0.15653200 -0.8798237  0.5070858
## (-0.411,-0.39]  -0.05671874 -0.2954931 -0.07145661 -0.5796239 -0.1398876
## (-0.39,0.00739] -0.36785679  0.1578530  0.15226017  0.3411393  0.2034172
## (0.00739,9.92]  -0.48724019  1.0149946 -0.07933396  1.0493743 -0.3988409
##                        age        dis        rad        tax    ptratio
## [-0.419,-0.411] -0.8754368  0.8173022 -0.6778652 -0.7632688 -0.5292187
## (-0.411,-0.39]  -0.2860667  0.4154995 -0.5412349 -0.4801365 -0.1020071
## (-0.39,0.00739]  0.4220071 -0.3820385 -0.4673129 -0.3638687 -0.3113986
## (0.00739,9.92]   0.8103835 -0.8543755  1.6596029  1.5294129  0.8057784
##                      black       lstat        medv
## [-0.419,-0.411]  0.3767440 -0.77599060  0.58741184
## (-0.411,-0.39]   0.3055937 -0.11552933 -0.03130424
## (-0.39,0.00739]  0.1199613 -0.05283233  0.22839004
## (0.00739,9.92]  -0.8453562  0.91525025 -0.70594072
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.17775060  0.72080026 -1.07887719
## indus    0.09183123 -0.11379165  0.16855818
## chas    -0.16330934 -0.08170782  0.11600342
## nox      0.19804250 -0.99872916 -1.24242284
## rm      -0.18237695 -0.12744251 -0.21833484
## age      0.10453765 -0.40829530  0.14033419
## dis     -0.23821427 -0.44335065  0.66704411
## rad      4.61979022  1.11634493 -0.18856259
## tax      0.06364534 -0.22784196  0.76933976
## ptratio  0.15065234 -0.05109953 -0.31208373
## black   -0.11580642  0.04045493  0.09313008
## lstat    0.17508900 -0.14884764  0.38003514
## medv     0.17839712 -0.40729358 -0.07819614
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9688 0.0229 0.0082

In the results above, you can see for example that linear discriminant function1 (LD1) explained 95% of the between groups (classes) variance. So, it is the most important function of distinguishing between classes. Furthermore, it seems the variable rad is the best linear separator for the crime classes (groups).

# the function for lda biplot arrows for LDA (bi)plot drawing
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

In the LDA biplot, you can see how each class is distinguished from others as colors. Only rad is powerful linear separator marked by the arrow based on the coefficient.

6. Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739]
##   [-0.419,-0.411]              14             10               1
##   (-0.411,-0.39]                5             16               7
##   (-0.39,0.00739]               0              9              11
##   (0.00739,9.92]                0              0               1
##                  predicted
## correct           (0.00739,9.92]
##   [-0.419,-0.411]              0
##   (-0.411,-0.39]               0
##   (-0.39,0.00739]              4
##   (0.00739,9.92]              24

Comment on the results.

As you can see in table above, we manage to correctly classify 13% of the observations in the first group (the interval: [-.419, .411], 13% of the observations in the second group (the interval: (-.411, -.39], 15% of the observations in the third group (the interval: (-.39, .00739], and 28% of the observations in the fourth group (the interval: (.00739, 9.92].

When knitting these percentage change and above you find them:)?!

7. Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

Reload the Boston dataset and standardize the dataset.

# access the MASS package
library(MASS)

# load the data
data("Boston")

# center and standardize variables
boston_stand <- scale(Boston)

Calculate the distances between the observations.

# euclidean distance matrix
dist_eu <- dist(boston_stand)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(boston_stand, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4830 12.6100 13.5500 17.7600 48.8600

Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

centers=2

# 1. Centers=6

# Euclidean distance matrix
dist_eu <- dist(boston_stand)

# k-means clustering
km <-kmeans(dist_eu, centers =2)

# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)

set.seed(123)

centers=3

# 1. Centers=9

# Euclidean distance matrix
dist_eu <- dist(boston_stand)

# k-means clustering
km <-kmeans(dist_eu, centers =3)

# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)

set.seed(123)

centers=4.

#1. Centers=10

# Euclidean distance matrix
dist_eu <- dist(boston_stand)

# k-means clustering
km <-kmeans(dist_eu, centers =4)

# plot the Boston dataset with clusters
pairs(boston_stand, pairs(boston_stand), col = km$cluster)

set.seed(123)

Interpret the results.

Pair figures in which clusters are separated with color showed that the optimal number of clusters might be three. As the data points did not change cluster when centroid is 4. In turn, when centroid is 3, the data point changed the cluster (2->3). Hence, we suggested that the optimal number of clusters are 3.

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?

Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset.

Based on above we choose centers=4 and number of clusters = 4 (over optimal).

# access the MASS package
library(MASS)

# load the data
data("Boston")

# center and standardize variables
boston_stand <- scale(Boston)

set.seed(123)

# euclidean distance matrix
dist_eu <- dist(boston_stand)

# determine the number of clusters
k_max <- 4

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

# k-means clustering
km <-kmeans(dist_eu, centers = 4)

# plot the Boston dataset with clusters
pairs(boston_stand, col = km$cluster)

Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model.

# linear discriminant analysis, clusters as target classes and all other as predictors
lda2.fit <- lda(km$cluster~ ., data = Boston)

# print the lda.fit object
lda2.fit
## Call:
## lda(km$cluster ~ ., data = Boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2272727 0.1304348 0.2114625 0.4308300 
## 
## Group means:
##         crim        zn     indus      chas       nox       rm      age
## 1  6.0201918  0.000000 19.295565 0.0000000 0.6589652 6.090009 90.29652
## 2 15.9401909  0.000000 18.470303 0.1818182 0.7104242 5.760379 92.71364
## 3  0.2484426 40.915888  5.130748 0.2149533 0.4694196 6.982925 49.01308
## 4  0.2636922  6.293578  7.560505 0.0000000 0.4943982 6.203284 59.40963
##        dis       rad      tax  ptratio    black     lstat     medv
## 1 2.168761 17.391304 582.0261 19.75652 355.1060 17.020696 17.50783
## 2 1.769233 20.818182 626.8333 19.36515 205.5244 21.173030 15.00000
## 3 5.427185  4.355140 300.8879 16.36449 387.8884  6.797196 32.82617
## 4 4.465165  4.550459 303.0688 18.52018 387.9413 10.643807 22.41193
## 
## Coefficients of linear discriminants:
##                  LD1          LD2           LD3
## crim    -0.021057935 -0.058271581  0.0703771294
## zn      -0.018564727 -0.044961797 -0.0289018272
## indus   -0.200796084  0.043976280 -0.1560182615
## chas     0.169607803 -2.991498755  0.8838096601
## nox     -9.033201420 -3.331966520  2.8710406785
## rm       0.212247438 -0.214962952 -0.9669930466
## age      0.003516101  0.001858370 -0.0093380639
## dis     -0.062398001 -0.075668875  0.0165639237
## rad     -0.075596848  0.059603084 -0.0552930235
## tax     -0.001714967 -0.003425927 -0.0006141385
## ptratio -0.102713176  0.077073413  0.0424108370
## black    0.004680516  0.006401206 -0.0098438564
## lstat   -0.034057442 -0.086790743  0.0015673331
## medv    -0.023878793 -0.103139302  0.0185551449
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7596 0.1768 0.0636

Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution)

# the function for lda biplot arrows for LDA (bi)plot drawing
lda2.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results
plot(lda2.fit, dimen = 2, col = classes, pch = classes)
lda2.arrows(lda2.fit, myscale = 1)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

Interpret the results. Which variables are the most influencial linear separators for the clusters?

In the results above, you can see for example that linear discriminant function1 (LD1) explained 76% of the between clusters variance. So, it is the most important function of distinguishing between clusters. Noted also that linear discriminant2 (LD2) explained 18% of the between clusters variance and LD3 did not explained so much of the variance (.06%). So assumption of three clusters is quite correct.

Furthermore, it seems that the variable nox is the most influencial linear separators for clusters regarding LD1 and also LD2; and second highest influencial is chas regarding LD2.

In the LDA biplot, you can see how each class is distinguished from other as colors. As we mentioned only nox and chas are influencial linear separator marked by the arrow based on the coefficient.

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.

crimeclass = train$crime
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = crimeclass)

Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km)

Figures were very similar.

  1. The 3d plot where the color is defined by the crimeclass showed very clearly different groups. This separate groups by colors. As mentioned above the group (.00739, 9.92] can be best to differentiate.

  2. In turn, the 3D where the color is defined by the clusters of the K-means did not show the color and groups are hard to identify.